Binary (Logistic) Regression with Three Predictors
To illustrate a regression model with three predictor variables, we will develop a logistic regression model for obesity as a function of gender, race and age for the US adult (age 18-79) population.
Gender is a binary variable with 2 levels (male and female), race is a categorical variable with 5 levels (White, Black, Mexican, Hispanic, Other), and age is a quantitative variable.
We will model the effect of age using natural cubic splines. In formulating our initial model, we need to specify the number of parameters (df) (complexity or degree of nonlinearity) for the age effect. In most situations, we will default to using a natural cubic spline with 4 df, whihc typically represents an appropriate balance between flexibility and loss of precision caused by overfitting. Small samples (\(N < 30\)) may require the use of 2 df, and moderate samples (\(N < 100\)) may require the use of 3 df. Empirical evidence suggests that more than 4 df are seldom required in a natural cubic spline regression model. Natural cubic spline models with 5 or 6 df should only be considered if the sample size is very large. Based on these considerations, since the effective sample size is fairly large, we will model the effect of age using a natural cubic spline with 5 df. (Using different df for the age effect and for the race effect is also helpful for the clarity of the presentation.)
The choice between a model with additive effects of all three predictor variables or a model with one or more interaction terms should depend primarily on the scientific context and, to a large extent, on the available sample size. The number of parameters \(p\) (regression coefficients) that can be reliably estimated is \(m/15\), where \(m\) is the effective sample size (for logistic regression, \(m\) is the minimum of the number of events and the number of non-events in our sample).
Possible Models
We denote the obesity outcome for subject \(i\) by \(y_i\), the gender for subject \(i\) by \(x_{1i}\), the race for subject \(i\) by \(x_{2i}\), and the age for subject \(i\) by \(x_{3i}\).
A logistic regression model for (prevalence of) obesity as a function of gender, race and age including all possible interaction terms is \[ \begin{eqnarray*} y_i & \sim & \mbox{Bernoulli}\left\{p_i\right\} \\ \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) + \\ & & f_{g,r}(x_{1i},x_{2i},{\bf \beta\gamma}) + f_{g,a}(x_{1i},x_{3i},{\bf \beta\delta}) + f_{r,a}(x_{2i},x_{3i},{\bf \gamma\delta}) + \\ & & f_{g,r,a}(x_{1i},x_{2i},x_{3i},{\bf \beta\gamma\delta}) \\ \end{eqnarray*} \]
For modeling, gender must be coded with \(1~(=2-1)\) numeric variable, and race must be coded with \(4~(=5-1)\) numeric variables, and age will be represented by 5 variables, the basis for the natural cubic spline with 5 df or 6 knots (more generally, \(k\) variables, the basis for the natural cubic spline with \(k\) df or \(k+1\) knots).
The two-way interaction between gender and race is encoded using the \(4 = 1 \times 4\) products of the \(1\) variable used to encode gender and the \(4\) variables used to encode race. The two-way interaction between gender and age is encoded using the \(5 = 1 \times 5\) products of the \(1\) variable used to encode gender and the \(5\) variables used to encode age. The two-way interaction between race and age is encoded using the \(20 = 4 \times 5\) products of the \(4\) variables used to encode race and the \(5\) variables used to encode age.
The three-way interaction between gender, race and age is encoded using the \(20 = 1 \times 4 \times 5\) products of the \(1\) variable used to encode gender, the \(4\) variables used to encode race and the \(5\) variables used to encode age.
The model with all of these terms uses \[59 = 2 \times 5 \times 6 - 1 = 1 + 4 + 5 + 4 + 5 + 20 + 20\] df (regression parameters excluding the intercept).
There are seven simplifications of this model that could be of substantive interest: 1. an additive model for gender, race and age with \(10 = 1 + 4 + 5\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) \\ \end{eqnarray*} \] * the effect of gender does not vary by race and/or age * the effect of race does not vary by gender and/or age * the effect of age does not vary by gender and/or race 2. a two-way interaction between gender and race plus an additive effect of age with \(14 = 1 + 4 + 5 + 4\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{g,r}(x_{1i},x_{2i},{\bf \beta\gamma}) + f_{a}(x_{3i},{\bf \delta}) \\ \end{eqnarray*} \] * the effect of gender varies by race, but not by age * the effect of race varies by gender, but not by age * the effect of age does not vary by gender and/or race 3. a two-way interaction between gender and age plus an additive effect of race with \(15 = 1 + 4 + 5 + 5\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) + f_{g,a}(x_{1i},x_{3i},{\bf \beta\delta}) \\ \end{eqnarray*} \] * the effect of gender varies by age, but not by race * the effect of race does not vary by gender and/or age * the effect of age varies by gender, but not by race 4. a two-way interaction between age and race plus an additive effect of gender with \(30 = 1 + 4 + 5 + 20\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) + f_{r,a}(x_{2i},x_{3i},{\bf \gamma\delta}) \\ \end{eqnarray*} \] * the effect of gender does not vary by race and/or age * the effect of race varies by age, but not by gender * the effect of age varies by race, but not by gender 5. a two-way interaction between gender and race plus a two-way interaction between gender and age with \(19 = 1 + 4 + 5 + 4 + 5\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) + \\ & & f_{g,r}(x_{1i},x_{2i},{\bf \beta\gamma}) + f_{g,a}(x_{1i},x_{3i},{\bf \beta\delta}) \\ \end{eqnarray*} \] * the effect of gender varies by race and age * the effect of race varies by gender, but not by age * the effect of age varies by gender, but not by race 6. a two-way interaction between race and gender plus a two-way interaction between race and age with \(34 = 1 + 4 + 5 + 4 + 20\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) + \\ & & f_{g,r}(x_{1i},x_{2i},{\bf \beta\gamma}) + f_{r,a}(x_{2i},x_{3i},{\bf \gamma\delta}) \\ \end{eqnarray*} \] * the effect of gender varies by race, but not by age * the effect of race varies by gender and age * the effect of age varies by race, but not by gender 7. a two-way interaction between gender and age plus a two-way interaction between race and age with \(35 = 1 + 4 + 5 + 5 + 20\) df \[ \begin{eqnarray*} \mbox{logit}\left\{p_i\right\} & = & \alpha_0 + f_{g}(x_{1i},{\bf \beta}) + f_{r}(x_{2i},{\bf \gamma}) + f_{a}(x_{3i},{\bf \delta}) + \\ & & f_{g,a}(x_{1i},x_{3i},{\bf \beta\delta}) + f_{r,a}(x_{2i},x_{3i},{\bf \gamma\delta}) \\ \end{eqnarray*} \] * the effect of gender varies by age, but not by race * the effect of race varies by age, but not by gender * the effect of age varies by gender and race
The usefulness of these potential simplifications largely depends on the goals of our study. For example, if racial comparisons (the effects of race) are of primary interest, model 3 (the effect of race does not vary by gender and/or age), model 5 (the effect of race varies by gender, but not by age), and model 7 (the effect of race varies by age, but not by gender) represent meaningful simplifications of the full model with a three-way interaction, while the other models are not.
Initial Model: Three-Way Interaction Between Gender, Race and Age
Moving forward, we will assume our primary “exposure” variable of interest is race; gender and age are potential confounders and/or effect modifiers. Given our large effective sample size \((m=2592)\), we will allow for a potential three-way interaction between race, gender and age.
LIBNAME NHANES "../../../BMI552_Datasets/NHANES_May2023/";
*LIBNAME NHANES "~/my_shared_file_links/u48718377/NHANES";
OPTIONS FMTSEARCH=(NHANES.formats_NHANES work) NODATE NONUMBER;
TITLE1; TITLE2;
ODS NOPROCTITLE;
ODS GRAPHICS / LOESSMAXOBS=20000;
DATA Ron;
SET NHANES.NHANES (RENAME=(Race1=Race));
RUN;
ODS SELECT NONE;
PROC LOGISTIC DATA=Ron;
ODS SELECT ResponseProfile;
* ODS SELECT ClassLevelInfo;
* ODS SELECT SplineKnots;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS Race*AgeCS
Gender*Race*AgeCS / LINK=LOGIT;
OUTPUT OUT=Tmp RESCHI=Residual PREDICTED=Fitted;
STORE GenderRaceAge;
RUN;
ODS SELECT ALL;
| Response Profile | ||
|---|---|---|
|
Ordered Value |
Obese |
Total Frequency |
| 1 | No | 4643 |
| 2 | Yes | 2592 |
Probability modeled is Obese='Yes'.
Assessing Adequacy of Modeling Assumptions
Before using the model, we must evaluate the adequacy of the modeling assumptions. For logistic regression models, the only assumptions are correct structural model and independence (which cannot be evaluated based on the observed data). The only concern is the potential for inadequacies in the representation of the effect of age, e.g. insufficient df. We can evaluate this assumption using plots of the residuals against age by gender and/or race.
PROC SGPANEL DATA=Tmp;
PANELBY Gender / LAYOUT=ROWLATTICE;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
PROC SGPANEL DATA=Tmp;
PANELBY Race / LAYOUT=ROWLATTICE;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
PROC SGPANEL DATA=Tmp;
PANELBY Race Gender;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
The lack of any clear patterns in any of these plots suggests that the correct structural model assumption is valid.
Alternatively, we can use AIC or BIC to formally assess the adequacy of the model fit.
ODS EXCLUDE ALL;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=DF5;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS Race*AgeCS
Gender*Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=DF6;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(2.5 18.3333 34.1667 50 65.8333 81.6667 97.5));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS Race*AgeCS
Gender*Race*AgeCS / LINK=LOGIT;
RUN;
DATA DF5;
SET DF5;
WHERE Test="Likelihood Ratio";
Model = "Age 5df";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA DF6;
SET DF6;
WHERE Test="Likelihood Ratio";
Model = "Age 6df";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA ModelComparisons;
LENGTH Model $20;
SET DF5 DF6;
KEEP Model DF AIC BIC;
RUN;
PROC MEANS DATA=ModelComparisons MAX NOPRINT;
VAR AIC BIC;
OUTPUT OUT=MaxIC MAX= / AUTONAME;
RUN;
DATA ModelComparisons;
IF _N_=1 THEN SET MaxIC;
SET ModelComparisons;
AIC = AIC_Max - AIC;
BIC = BIC_Max - BIC;
KEEP Model DF AIC BIC;
RUN;
ODS EXCLUDE NONE;
TITLE "Model Comparisons (AIC/BIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
VAR Model DF AIC BIC;
FORMAT AIC BIC 6.1;
RUN;
TITLE;
| Model | DF | AIC | BIC |
|---|---|---|---|
| Age 5df | 59 | 0.0 | 0.0 |
| Age 6df | 69 | 4.7 | 63.3 |
Based on BIC (my preference for considering expansions of the model), there is conclusive evidence for the model using 5 df for the age effect.
Odds Ratios for Race by Gender and Age
ESTIMATE statements can be used to find comparisons of any two races for any combination of gender and age.
ODS EXCLUDE ALL;
PROC PLM RESTORE=GenderRaceAge;
ODS OUTPUT Estimates=E;
ESTIMATE "Mexican v Hispanic, Female 30" Race [1, 3] [-1, 4]
Gender*Race [1, 2 3] [-1, 2 4]
Race*AgeCS [1, 3 30] [-1, 4 30]
Gender*Race*AgeCS [1, 2 3 30] [-1, 2 4 30] / EXP CL ALPHA=0.03125;
ESTIMATE "Mexican v Hispanic, Male 30" Race [1, 3] [-1, 4]
Gender*Race [1, 1 3] [-1, 1 4]
Race*AgeCS [1, 3 30] [-1, 4 30]
Gender*Race*AgeCS [1, 1 3 30] [-1, 1 4 30] / EXP CL ALPHA=0.03125;
ESTIMATE "Mexican v Hispanic, Female 50" Race [1, 3] [-1, 4]
Gender*Race [1, 2 3] [-1, 2 4]
Race*AgeCS [1, 3 50] [-1, 4 50]
Gender*Race*AgeCS [1, 2 3 50] [-1, 2 4 50] / EXP CL ALPHA=0.03125;
ESTIMATE "Mexican v Hispanic, Male 50" Race [1, 3] [-1, 4]
Gender*Race [1, 1 3] [-1, 1 4]
Race*AgeCS [1, 3 50] [-1, 4 50]
Gender*Race*AgeCS [1, 1 3 50] [-1, 1 4 50] / EXP CL ALPHA=0.03125;
ESTIMATE "Mexican v Hispanic, Female 70" Race [1, 3] [-1, 4]
Gender*Race [1, 2 3] [-1, 2 4]
Race*AgeCS [1, 3 70] [-1, 4 70]
Gender*Race*AgeCS [1, 2 3 70] [-1, 2 4 70] / EXP CL ALPHA=0.03125;
ESTIMATE "Mexican v Hispanic, Male 70" Race [1, 3] [-1, 4]
Gender*Race [1, 1 3] [-1, 1 4]
Race*AgeCS [1, 3 70] [-1, 4 70]
Gender*Race*AgeCS [1, 1 3 70] [-1, 1 4 70] / EXP CL ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label LowerExp ExpEstimate UpperExp;
LABEL Label='00'x
LowerExp="Lower CL"
ExpEstimate="Estimate"
UpperExp="Upper CL";
TITLE "Odds Ratios for Race by Gender and Age";
RUN;
PROC SGPLOT DATA=E NOAUTOLEGEND;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
SCATTER Y=Label X=ExpEstimate / XERRORLOWER=LowerExp XERRORUPPER=UpperExp
MARKERATTRS=(SYMBOL=Plus COLOR=Cx1b9e77)
ERRORBARATTRS=(THICKNESS=2 COLOR=Cx1b9e77) NOERRORCAPS;
REFLINE 1 / AXIS=X;
XAXIS GRID TYPE=LOG LABEL="Odds Ratio"; /* specify log scale */
YAXIS GRID DISPLAY=(NOLABEL) DISCRETEORDER=DATA REVERSE;
RUN;
TITLE;
| Lower CL | Estimate | Upper CL | |
|---|---|---|---|
| Mexican v Hispanic, Female 30 | 0.7741 | 1.8619 | 4.4785 |
| Mexican v Hispanic, Male 30 | 0.8677 | 2.0123 | 4.6669 |
| Mexican v Hispanic, Female 50 | 1.3438 | 3.9763 | 11.7661 |
| Mexican v Hispanic, Male 50 | 0.2866 | 0.7941 | 2.2007 |
| Mexican v Hispanic, Female 70 | 0.1422 | 0.4814 | 1.6302 |
| Mexican v Hispanic, Male 70 | 0.2636 | 0.9056 | 3.1114 |
The odds ratios for obesity comparing Mexicans to Hispanics are \((0.77,4.48)\) for 30-year-old females, \((1.34,11.7)\) for 50-year-old females, and \((0.14,1.63)\) for 70-year-old females. The odds ratios for obesity comparing Mexicans to Hispanics are \((0.87,4.67)\) for 30-year-old males, \((0.29,2.20)\) for 50-year-old males, and \((0.26,3.11)\) for 70-year-old males.
ODS EXCLUDE ALL;
PROC PLM RESTORE=GenderRaceAge;
ODS OUTPUT Estimates=E;
ESTIMATE "Black v White, Female 30" Race [1, 2] [-1, 1]
Gender*Race [1, 2 2] [-1, 2 1]
Race*AgeCS [1, 2 30] [-1, 1 30]
Gender*Race*AgeCS [1, 2 2 30] [-1, 2 1 30] / EXP CL ALPHA=0.03125;
ESTIMATE "Black v White, Male 30" Race [1, 2] [-1, 1]
Gender*Race [1, 1 2] [-1, 1 1]
Race*AgeCS [1, 2 30] [-1, 1 30]
Gender*Race*AgeCS [1, 1 2 30] [-1, 1 1 30] / EXP CL ALPHA=0.03125;
ESTIMATE "Black v White, Female 50" Race [1, 2] [-1, 1]
Gender*Race [1, 2 2] [-1, 2 1]
Race*AgeCS [1, 2 50] [-1, 1 50]
Gender*Race*AgeCS [1, 2 2 50] [-1, 2 1 50] / EXP CL ALPHA=0.03125;
ESTIMATE "Black v White, Male 50" Race [1, 2] [-1, 1]
Gender*Race [1, 1 2] [-1, 1 1]
Race*AgeCS [1, 2 50] [-1, 1 50]
Gender*Race*AgeCS [1, 1 2 50] [-1, 1 1 50] / EXP CL ALPHA=0.03125;
ESTIMATE "Black v White, Female 70" Race [1, 2] [-1, 1]
Gender*Race [1, 2 2] [-1, 2 1]
Race*AgeCS [1, 2 70] [-1, 1 70]
Gender*Race*AgeCS [1, 2 2 70] [-1, 2 1 70] / EXP CL ALPHA=0.03125;
ESTIMATE "Black v White, Male 70" Race [1, 2] [-1, 1]
Gender*Race [1, 1 2] [-1, 1 1]
Race*AgeCS [1, 2 70] [-1, 1 70]
Gender*Race*AgeCS [1, 1 2 70] [-1, 1 1 70] / EXP CL ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label LowerExp ExpEstimate UpperExp;
LABEL Label='00'x
LowerExp="Lower CL"
ExpEstimate="Estimate"
UpperExp="Upper CL";
TITLE "Odds Ratios for Race by Gender and Age";
RUN;
PROC SGPLOT DATA=E NOAUTOLEGEND;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
SCATTER Y=Label X=ExpEstimate / XERRORLOWER=LowerExp XERRORUPPER=UpperExp
MARKERATTRS=(SYMBOL=Plus COLOR=Cx1b9e77)
ERRORBARATTRS=(THICKNESS=2 COLOR=Cx1b9e77) NOERRORCAPS;
REFLINE 1 / AXIS=X;
XAXIS GRID TYPE=LOG LABEL="Odds Ratio"; /* specify log scale */
YAXIS GRID DISPLAY=(NOLABEL) DISCRETEORDER=DATA REVERSE;
RUN;
TITLE;
| Lower CL | Estimate | Upper CL | |
|---|---|---|---|
| Black v White, Female 30 | 2.7013 | 4.7846 | 8.4745 |
| Black v White, Male 30 | 0.4777 | 0.8614 | 1.5533 |
| Black v White, Female 50 | 1.5222 | 2.5676 | 4.3307 |
| Black v White, Male 50 | 0.4343 | 0.7579 | 1.3225 |
| Black v White, Female 70 | 1.0474 | 1.7483 | 2.9182 |
| Black v White, Male 70 | 0.7321 | 1.3078 | 2.3364 |
The odds ratios for obesity comparing Blacks to Whites are \((2.70,8.47)\) for 30-year-old females, \((1.52,4.33)\) for 50-year-old females, and \((1.05,2.92)\) for 70-year-old females. The odds ratios for obesity comparing Blacks to Whites are \((0.48,1.55)\) for 30-year-old males, \((0.43,1.32)\) for 50-year-old males, and \((0.73,2.34)\) for 70-year-old males.
Prevalence by Race, Gender and Age
ESTIMATE statements can be used to obtain the prevalence of obesity for any combination of race, gender and age. Note that, in the context of the current model, the prevalence of obesity cannot be obtained for any racial group without specifying the age and gender. There is no single prevalence of obesity for Blacks.
More generally, one can only estimate proportions by specifying values for all of the covariates in the regression model.
ODS EXCLUDE ALL;
PROC PLM RESTORE=GenderRaceAge;
ODS OUTPUT Estimates=E;
ESTIMATE "Mexican, Female 30" Intercept 1 Gender [1, 2] Race [1, 3] AgeCS [1, 30]
Gender*Race [1, 2 3] Gender*AgeCS [1, 2 30] Race*AgeCS [1, 3 30]
Gender*Race*AgeCS [1, 2 3 30] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Female 30" Intercept 1 Gender [1, 2] Race [1, 4] AgeCS [1, 30]
Gender*Race [1, 2 4] Gender*AgeCS [1, 2 30] Race*AgeCS [1, 4 30]
Gender*Race*AgeCS [1, 2 4 30] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Mexican, Male 30" Intercept 1 Gender [1, 1] Race [1, 3] AgeCS [1, 30]
Gender*Race [1, 1 3] Gender*AgeCS [1, 1 30] Race*AgeCS [1, 3 30]
Gender*Race*AgeCS [1, 1 3 30] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Male 30" Intercept 1 Gender [1, 1] Race [1, 4] AgeCS [1, 30]
Gender*Race [1, 1 4] Gender*AgeCS [1, 1 30] Race*AgeCS [1, 4 30]
Gender*Race*AgeCS [1, 1 4 30] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Mexican, Female 50" Intercept 1 Gender [1, 2] Race [1, 3] AgeCS [1, 50]
Gender*Race [1, 2 3] Gender*AgeCS [1, 2 50] Race*AgeCS [1, 3 50]
Gender*Race*AgeCS [1, 2 3 50] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Female 50" Intercept 1 Gender [1, 2] Race [1, 4] AgeCS [1, 50]
Gender*Race [1, 2 4] Gender*AgeCS [1, 2 50] Race*AgeCS [1, 4 50]
Gender*Race*AgeCS [1, 2 4 50] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Mexican, Male 50" Intercept 1 Gender [1, 1] Race [1, 3] AgeCS [1, 50]
Gender*Race [1, 1 3] Gender*AgeCS [1, 1 50] Race*AgeCS [1, 3 50]
Gender*Race*AgeCS [1, 1 3 50] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Male 50" Intercept 1 Gender [1, 1] Race [1, 4] AgeCS [1, 50]
Gender*Race [1, 1 4] Gender*AgeCS [1, 1 50] Race*AgeCS [1, 4 50]
Gender*Race*AgeCS [1, 1 4 50] / EXP ILINK ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label LowerMu Mu UpperMu;
LABEL Label='00'x
Mu="Estimate"
LowerMu="Lower CL"
UpperMu="Upper CL";
TITLE "Prevalence of Obesity";
RUN;
TITLE;
| Lower CL | Estimate | Upper CL | |
|---|---|---|---|
| Mexican, Female 30 | 0.3015 | 0.4267 | 0.5620 |
| Hispanic, Female 30 | 0.1673 | 0.2855 | 0.4430 |
| Mexican, Male 30 | 0.3127 | 0.4321 | 0.5600 |
| Hispanic, Male 30 | 0.1627 | 0.2744 | 0.4239 |
| Mexican, Female 50 | 0.4708 | 0.6382 | 0.7776 |
| Hispanic, Female 50 | 0.1605 | 0.3073 | 0.5072 |
| Mexican, Male 50 | 0.2204 | 0.3468 | 0.4993 |
| Hispanic, Male 50 | 0.2308 | 0.4007 | 0.5983 |
The estimated regression functions (log odds of obesity as a function of race, gender and age) can be plotted using PROC SGPLOT.
ODS EXCLUDE ALL;
DATA AllGendersAllRacesAllAges;
DO Gender = 1 TO 2 BY 1;
DO Race = 1 TO 5 BY 1;
DO Age = 18 TO 79 BY 1;
OUTPUT;
END;
END;
END;
FORMAT Gender GenderF.;
FORMAT Race Race1F.;
RUN;
PROC PLM RESTORE=GenderRaceAge;
SCORE DATA=AllGendersAllRacesAllAges OUT=Fitted Predicted LCLM=Lower_Logit UCLM=Upper_Logit / ALPHA=0.03125;
RUN;
DATA Fitted;
SET Fitted;
Lower_Prob = 1/(1+EXP(-Lower_Logit));
Upper_Prob = 1/(1+EXP(-Upper_Logit));
RUN;
ODS EXCLUDE NONE;
PROC SGPANEL DATA=Fitted;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
PANELBY Gender;
BAND X=Age Lower=Lower_Logit Upper=Upper_Logit /
GROUP=Race TRANSPARENCY=0.7;
ROWAXIS LABEL="Logit (Log Odds) of Obesity";
RUN;
Alternatively, the estimated prevalence of obesity can also be plotted using PROC SGPLOT.
PROC SGPANEL DATA=Fitted;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
PANELBY Gender;
BAND X=Age Lower=Lower_Prob Upper=Upper_Prob /
GROUP=Race TRANSPARENCY=0.7;
ROWAXIS LABEL="Prevalence of Obesity";
RUN;
Hypothesis Tests
Likelihood ratio tests can be used to assess the evidence for (1) any effect of race, (2) varying effects of race by gender, (3) varying effects of race by age, and (4) varying effects of race by age and/or gender.
ODS EXCLUDE ALL;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Full;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS Race*AgeCS
Gender*Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderRace_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=RaceAge_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*AgeCS Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Race_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender AgeCS
Gender*AgeCS / LINK=LOGIT;
RUN;
DATA Full;
SET Full;
WHERE Test="Likelihood Ratio";
RENAME ChiSq=ChiSq_Full DF=DF_Full;
RUN;
DATA GenderRace_GenderAge;
LENGTH Model $50 Test $50;
SET GenderRace_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Gender*Race Only";
Test = "Race Effects Vary by Age";
KEEP Model Test ChiSq DF;
RUN;
DATA RaceAge_GenderAge;
LENGTH Model $50 Test $50;
SET RaceAge_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Race*Age Only";
Test = "Race Effects Vary by Gender";
KEEP Model Test ChiSq DF;
RUN;
DATA Race_GenderAge;
LENGTH Model $50 Test $50;
SET Race_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Race Additive";
Test = "Race Effects Vary by Gender and/or Age";
KEEP Model Test ChiSq DF;
RUN;
DATA GenderAge;
LENGTH Model $50 Test $50;
SET GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Race Omitted";
Test = "Overall Effect of Race";
KEEP Model Test ChiSq DF;
RUN;
DATA LRTests;
LENGTH Model $50 Test $50;
IF _N_ = 1 THEN SET Full;
SET GenderRace_GenderAge RaceAge_GenderAge Race_GenderAge GenderAge;
Chisq = Chisq_Full-Chisq;
DF = DF_Full - DF;
ProbChiSq = EXP(LOGSDF("CHISQUARE",Chisq,DF));
sValue = -LOG(ProbChiSq)/LOG(2);
KEEP Test DF ChiSq ProbChiSq sValue;
RUN;
ODS EXCLUDE NONE;
TITLE "Likelihood Ratio Tests";
PROC PRINT DATA=LRTests NOOBS;
VAR Test DF ChiSq ProbChiSq sValue;
FORMAT sValue 6.2;
FORMAT ChiSq 6.2;
RUN;
TITLE;
| Test | DF | ChiSq | ProbChiSq | sValue |
|---|---|---|---|---|
| Race Effects Vary by Age | 40 | 84.51 | <.0001 | 14.28 |
| Race Effects Vary by Gender | 24 | 91.18 | <.0001 | 30.02 |
| Race Effects Vary by Gender and/or Age | 44 | 131.08 | <.0001 | 32.76 |
| Overall Effect of Race | 48 | 249.26 | <.0001 | 93.85 |
There is overwhelming evidence that the prevalence of obesity varies by race \((p \approx 0, s=94)\). There is very strong evidence of effect modification by age \((p \approx 0, s=14)\) and gender \((p \approx 0, s=30)\).
Initial Model: Additive Model
For illustration, we will revisit the modeling process starting with the additive model.
ODS SELECT NONE;
PROC LOGISTIC DATA=Ron;
* ODS SELECT ResponseProfile;
* ODS SELECT ClassLevelInfo;
* ODS SELECT SplineKnots;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
OUTPUT OUT=Tmp RESCHI=Residual PREDICTED=Fitted;
STORE GenderRaceAge;
RUN;
ODS SELECT ALL;
Assessing Adequacy of Modeling Assumptions
Before using the model, we must evaluate the adequacy of the modeling assumptions. For logistic regression models, the only assumptions are correct structural model and independence (which cannot be evaluated based on the observed data). The only concern is the potential for inadequacies in the representation of the effect of age, e.g. insufficient df, or the omission of required interaction terms. We can evaluate this assumption using plots of the residuals.
PROC SGPANEL DATA=Tmp;
PANELBY Gender / LAYOUT=ROWLATTICE;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
PROC SGPANEL DATA=Tmp;
PANELBY Race / LAYOUT=ROWLATTICE;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
PROC SGPANEL DATA=Tmp;
PANELBY Race Gender;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
There are some indications of a lack of model fit, particularly the clear trends in the residual plot for Hispanics.
Alternatively, we can use AIC or BIC to formally assess the adequacy of the model fit.
ODS EXCLUDE ALL;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=DF5;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=DF6;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(2.5 18.3333 34.1667 50 65.8333 81.6667 97.5));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=ThreeWay;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS Race*AgeCS
Gender*Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=RaceAge_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderRace_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderRace_RaceAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Race_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Gender_RaceAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Age_GenderRace;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race / LINK=LOGIT;
RUN;
DATA DF5;
LENGTH Model $40;
SET DF5;
WHERE Test="Likelihood Ratio";
Model = "Age 5df";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA DF6;
LENGTH Model $40;
SET DF6;
WHERE Test="Likelihood Ratio";
Model = "Age 6df";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA ThreeWay;
LENGTH Model $40;
SET ThreeWay;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA RaceAge_GenderAge;
LENGTH Model $40;
SET RaceAge_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Gender x Age + Race x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA GenderRace_GenderAge;
LENGTH Model $40;
SET GenderRace_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race + Gender x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA GenderRace_RaceAge;
LENGTH Model $40;
SET GenderRace_RaceAge;
WHERE Test="Likelihood Ratio";
Model = "Gender x Age + Race x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA Race_GenderAge;
LENGTH Model $40;
SET Race_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Gender x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA Gender_RaceAge;
LENGTH Model $40;
SET Gender_RaceAge;
WHERE Test="Likelihood Ratio";
Model = "Race x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA Age_GenderRace;
LENGTH Model $40;
SET Age_GenderRace;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA ModelComparisons;
LENGTH Model $40;
SET DF5 DF6
ThreeWay RaceAge_GenderAge GenderRace_GenderAge GenderRace_RaceAge
Race_GenderAge Gender_RaceAge Age_GenderRace;
KEEP Model DF AIC BIC;
RUN;
PROC MEANS DATA=ModelComparisons MAX NOPRINT;
VAR AIC BIC;
OUTPUT OUT=MaxIC MAX= / AUTONAME;
RUN;
DATA ModelComparisons;
IF _N_=1 THEN SET MaxIC;
SET ModelComparisons;
AIC = AIC_Max - AIC;
BIC = BIC_Max - BIC;
KEEP Model DF AIC BIC;
RUN;
ODS EXCLUDE NONE;
TITLE "Model Comparisons (AIC/BIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
VAR Model DF AIC BIC;
FORMAT AIC BIC 6.1;
RUN;
TITLE;
| Model | DF | AIC | BIC |
|---|---|---|---|
| Age 5df | 10 | 42.4 | 13.2 |
| Age 6df | 11 | 43.5 | 20.2 |
| Gender x Race x Age | 59 | 0.0 | 257.9 |
| Gender x Age + Race x Age | 34 | 6.5 | 118.0 |
| Gender x Race + Gender x Age | 19 | 4.5 | 28.0 |
| Gender x Age + Race x Age | 34 | 6.5 | 118.0 |
| Gender x Age | 15 | 43.1 | 43.1 |
| Race x Age | 30 | 42.4 | 130.4 |
| Gender x Race | 14 | 5.8 | 0.0 |
Based on BIC, the optimal model is a two-way interaction between gender and race with an additive effect of age. This model is decisively better than our initial model (or any of the other models), so we will move forward using this revised model.
Revised Model: Gender-Race Interaction + Age
ODS SELECT NONE;
PROC LOGISTIC DATA=Ron;
* ODS SELECT ResponseProfile;
* ODS SELECT ClassLevelInfo;
* ODS SELECT SplineKnots;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race Gender*Race AgeCS / LINK=LOGIT;
OUTPUT OUT=Tmp RESCHI=Residual PREDICTED=Fitted;
STORE GenderRaceAge;
RUN;
ODS SELECT ALL;
Assessing Adequacy of Modeling Assumptions
Before using the model, we must evaluate the adequacy of the modeling assumptions. For logistic regression models, the only assumptions are correct structural model and independence (which cannot be evaluated based on the observed data). The only concern is the potential for inadequacies in the representation of the effect of age, e.g. insufficient df, or the omission of required interaction terms. We can evaluate this assumption using plots of the residuals.
PROC SGPANEL DATA=Tmp;
PANELBY Gender / LAYOUT=ROWLATTICE;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
PROC SGPANEL DATA=Tmp;
PANELBY Race / LAYOUT=ROWLATTICE;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
PROC SGPANEL DATA=Tmp;
PANELBY Race Gender;
REFLINE 0 / AXIS=Y;
* PBSPLINE X=Age Y=Residual / JITTER; *large samples;
LOESS X=Age Y=Residual / JITTER; *small smaples;
RUN;
There are some indications of a lack of model fit, particularly the clear trends in the residual plot for Hispanics.
Alternatively, we can use AIC or BIC to formally assess the adequacy of the model fit.
ODS EXCLUDE ALL;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=DF5;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race Gender*Race AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=DF6;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(2.5 18.3333 34.1667 50 65.8333 81.6667 97.5));
MODEL Obese(EVENT="Yes") = Gender Race Gender*Race AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=ThreeWay;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race Gender*AgeCS Race*AgeCS
Gender*Race*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderRace_GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race Gender*Race AgeCS
Gender*AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderRace_RaceAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race Gender*Race AgeCS
Race*AgeCS / LINK=LOGIT;
RUN;
DATA DF5;
LENGTH Model $40;
SET DF5;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race + Age 5df";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA DF6;
LENGTH Model $40;
SET DF6;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race + Age 6df";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA ThreeWay;
LENGTH Model $40;
SET ThreeWay;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA GenderRace_GenderAge;
LENGTH Model $40;
SET GenderRace_GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race + Gender x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA GenderRace_RaceAge;
LENGTH Model $40;
SET GenderRace_RaceAge;
WHERE Test="Likelihood Ratio";
Model = "Gender x Race + Race x Age";
AIC = ChiSq - 2*(DF+1);
BIC = ChiSq - LOG(2592)*(DF+1);
KEEP Model DF AIC BIC;
RUN;
DATA ModelComparisons;
LENGTH Model $40;
SET DF5 DF6
ThreeWay GenderRace_GenderAge GenderRace_RaceAge;
KEEP Model DF AIC BIC;
RUN;
PROC MEANS DATA=ModelComparisons MAX NOPRINT;
VAR AIC BIC;
OUTPUT OUT=MaxIC MAX= / AUTONAME;
RUN;
DATA ModelComparisons;
IF _N_=1 THEN SET MaxIC;
SET ModelComparisons;
AIC = AIC_Max - AIC;
BIC = BIC_Max - BIC;
KEEP Model DF AIC BIC;
RUN;
ODS EXCLUDE NONE;
TITLE "Model Comparisons (AIC/BIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
VAR Model DF AIC BIC;
FORMAT AIC BIC 6.1;
RUN;
TITLE;
| Model | DF | AIC | BIC |
|---|---|---|---|
| Gender x Race + Age 5df | 14 | 5.8 | 0.0 |
| Gender x Race + Age 6df | 15 | 6.9 | 6.9 |
| Gender x Race x Age | 59 | 0.0 | 257.9 |
| Gender x Race + Gender x Age | 19 | 4.5 | 28.0 |
| Gender x Race + Race x Age | 34 | 6.5 | 118.0 |
Based on BIC, our current model, a two-way interaction between gender and race with an additive effect of age, is the optimal model and decisively better than the other models under consideration, so we will proceed with inference.
Odds Ratios for Race by Gender
ESTIMATE statements can be used to find comparisons of any two races for either gender and any fixed age.
ODS EXCLUDE ALL;
PROC PLM RESTORE=GenderRaceAge;
ODS OUTPUT Estimates=E;
ESTIMATE "Black v. White, Female" Race [1, 2] [-1, 1]
Gender*Race [1, 2 2] [-1, 2 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Mexican v. White, Female" Race [1, 3] [-1, 1]
Gender*Race [1, 2 3] [-1, 2 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Hispanic v. White, Female" Race [1, 4] [-1, 1]
Gender*Race [1, 2 4] [-1, 2 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Other v. White, Female" Race [1, 5] [-1, 1]
Gender*Race [1, 2 5] [-1, 2 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Black v. White, Male" Race [1, 2] [-1, 1]
Gender*Race [1, 1 2] [-1, 1 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Mexican v. White, Male" Race [1, 3] [-1, 1]
Gender*Race [1, 1 3] [-1, 1 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Hispanic v. White, Male" Race [1, 4] [-1, 1]
Gender*Race [1, 1 3] [-1, 1 1] / CL EXP ALPHA=0.03125;
ESTIMATE "Other v. White, Male" Race [1, 5] [-1, 1]
Gender*Race [1, 1 5] [-1, 1 1] / CL EXP ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label LowerExp ExpEstimate UpperExp;
LABEL Label='00'x
LowerExp="Lower CL"
ExpEstimate="Estimate"
UpperExp="Upper CL";
TITLE "Odds Ratios for Race by Gender and Age";
RUN;
PROC SGPLOT DATA=E NOAUTOLEGEND;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
SCATTER Y=Label X=ExpEstimate / XERRORLOWER=LowerExp XERRORUPPER=UpperExp
MARKERATTRS=(SYMBOL=Plus COLOR=Cx1b9e77)
ERRORBARATTRS=(THICKNESS=2 COLOR=Cx1b9e77) NOERRORCAPS;
REFLINE 1 / AXIS=X;
XAXIS GRID TYPE=LOG LABEL="Odds Ratio"; /* specify log scale */
YAXIS GRID DISPLAY=(NOLABEL) DISCRETEORDER=DATA REVERSE;
RUN;
TITLE;
| Lower CL | Estimate | Upper CL | |
|---|---|---|---|
| Black v. White, Female | 2.2714 | 2.8560 | 3.5911 |
| Mexican v. White, Female | 1.3065 | 1.7399 | 2.3172 |
| Hispanic v. White, Female | 0.7279 | 1.0093 | 1.3996 |
| Other v. White, Female | 0.3762 | 0.5238 | 0.7294 |
| Black v. White, Male | 0.8507 | 1.0892 | 1.3945 |
| Mexican v. White, Male | 0.9083 | 1.1778 | 1.5272 |
| Hispanic v. White, Male | 0.6639 | 0.8972 | 1.2127 |
| Other v. White, Male | 0.4232 | 0.5858 | 0.8109 |
For females, the odds ratios for obesity (adjusted for age) are \((2.27,3.59)\) comparing Blacks to Whites, \((1.31,2.32)\) comparing Mexicans to Whites, \((0.73,1.40)\) comparing Hispanics to Whites and \((0.38,0.73)\) comparing Other Races to Whites; for males, the odds ratios for obesity (adjusted for age) are \((0.85,1.39)\) comparing Blacks to Whites, \((0.91,1.53)\) comparing Mexicans to Whites, \((0.66,1.21)\) comparing Hispanics to Whites and \((0.42,0.81)\) comparing Other Races to Whites
Prevalence by Race, Gender and Age
ESTIMATE statements can be used to obtain the prevalence of obesity for any combination of race, gender and age. Note that, in the context of the current model, the prevalence of obesity cannot be obtained for any racial group without specifying the age and gender. There is no single prevalence of obesity for Blacks.
ODS EXCLUDE ALL;
PROC PLM RESTORE=GenderRaceAge;
ODS OUTPUT Estimates=E;
ESTIMATE "Mexican, Female 30" Intercept 1 Gender [1, 2] Race [1, 3] AgeCS [1, 30]
Gender*Race [1, 2 3] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Female 30" Intercept 1 Gender [1, 2] Race [1, 4] AgeCS [1, 30]
Gender*Race [1, 2 4] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Mexican, Male 30" Intercept 1 Gender [1, 1] Race [1, 3] AgeCS [1, 30]
Gender*Race [1, 1 3] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Male 30" Intercept 1 Gender [1, 1] Race [1, 4] AgeCS [1, 30]
Gender*Race [1, 1 4] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Mexican, Female 50" Intercept 1 Gender [1, 2] Race [1, 3] AgeCS [1, 50]
Gender*Race [1, 2 3] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Female 50" Intercept 1 Gender [1, 2] Race [1, 4] AgeCS [1, 50]
Gender*Race [1, 2 4] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Mexican, Male 50" Intercept 1 Gender [1, 1] Race [1, 3] AgeCS [1, 50]
Gender*Race [1, 1 3] / EXP ILINK ALPHA=0.03125;
ESTIMATE "Hispanic, Male 50" Intercept 1 Gender [1, 1] Race [1, 4] AgeCS [1, 50]
Gender*Race [1, 1 4] / EXP ILINK ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label LowerMu Mu UpperMu;
LABEL Label='00'x
Mu="Estimate"
LowerMu="Lower CL"
UpperMu="Upper CL";
TITLE "Prevalence of Obesity";
RUN;
TITLE;
| Lower CL | Estimate | Upper CL | |
|---|---|---|---|
| Mexican, Female 30 | 0.3863 | 0.4572 | 0.5298 |
| Hispanic, Female 30 | 0.2597 | 0.3282 | 0.4049 |
| Mexican, Male 30 | 0.3365 | 0.3985 | 0.4639 |
| Hispanic, Male 30 | 0.3221 | 0.3986 | 0.4803 |
| Mexican, Female 50 | 0.3842 | 0.4553 | 0.5282 |
| Hispanic, Female 50 | 0.2578 | 0.3265 | 0.4036 |
| Mexican, Male 50 | 0.3346 | 0.3966 | 0.4621 |
| Hispanic, Male 50 | 0.3199 | 0.3967 | 0.4790 |
The estimated regression functions (log odds of obesity as a function of race, gender and age) can be plotted using PROC SGPLOT.
ODS EXCLUDE ALL;
DATA AllGendersAllRacesAllAges;
DO Gender = 1 TO 2 BY 1;
DO Race = 1 TO 5 BY 1;
DO Age = 18 TO 79 BY 1;
OUTPUT;
END;
END;
END;
FORMAT Gender GenderF.;
FORMAT Race Race1F.;
RUN;
PROC PLM RESTORE=GenderRaceAge;
SCORE DATA=AllGendersAllRacesAllAges OUT=Fitted Predicted LCLM=Lower_Logit UCLM=Upper_Logit / ALPHA=0.03125;
RUN;
DATA Fitted;
SET Fitted;
Lower_Prob = 1/(1+EXP(-Lower_Logit));
Upper_Prob = 1/(1+EXP(-Upper_Logit));
RUN;
ODS EXCLUDE NONE;
PROC SGPANEL DATA=Fitted;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
PANELBY Gender;
BAND X=Age Lower=Lower_Logit Upper=Upper_Logit /
GROUP=Race TRANSPARENCY=0.7;
ROWAXIS LABEL="Logit (Log Odds) of Obesity";
RUN;
Alternatively, the estimated prevalence of obesity can also be plotted using PROC SGPLOT.
PROC SGPANEL DATA=Fitted;
STYLEATTRS DATACOLORS=(Cx1b9e77 Cxd95f02 Cx7570b3 Cxe7298a Cx66a61e Cxe6ab02 Cxa6761d Cx666666);
PANELBY Gender;
BAND X=Age Lower=Lower_Prob Upper=Upper_Prob /
GROUP=Race TRANSPARENCY=0.7;
ROWAXIS LABEL="Prevalence of Obesity";
RUN;
Hypothesis Tests
Likelihood ratio tests can be used to assess the evidence for (1) any effect of race and (2) varying effects of race by gender.
ODS EXCLUDE ALL;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Full;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS
Gender*Race / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=Additive;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
RUN;
PROC LOGISTIC DATA=Ron;
ODS OUTPUT GlobalTests=GenderAge;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender AgeCS / LINK=LOGIT;
RUN;
DATA Full;
SET Full;
WHERE Test="Likelihood Ratio";
RENAME ChiSq=ChiSq_Full DF=DF_Full;
RUN;
DATA Additive;
LENGTH Model $50 Test $50;
SET Additive;
WHERE Test="Likelihood Ratio";
Model = "Additive";
Test = "Race Effects Vary by Gender";
KEEP Model Test ChiSq DF;
RUN;
DATA GenderAge;
LENGTH Model $50 Test $50;
SET GenderAge;
WHERE Test="Likelihood Ratio";
Model = "Race Omitted";
Test = "Overall Effect of Race";
KEEP Model Test ChiSq DF;
RUN;
DATA LRTests;
LENGTH Model $50 Test $50;
IF _N_ = 1 THEN SET Full;
SET Additive GenderAge;
Chisq = Chisq_Full-Chisq;
DF = DF_Full - DF;
ProbChiSq = EXP(LOGSDF("CHISQUARE",Chisq,DF));
sValue = -LOG(ProbChiSq)/LOG(2);
KEEP Test DF ChiSq ProbChiSq sValue;
RUN;
ODS EXCLUDE NONE;
TITLE "Likelihood Ratio Tests";
PROC PRINT DATA=LRTests NOOBS;
VAR Test DF ChiSq ProbChiSq sValue;
FORMAT sValue 6.2;
FORMAT ChiSq 6.2;
RUN;
TITLE;
| Test | DF | ChiSq | ProbChiSq | sValue |
|---|---|---|---|---|
| Race Effects Vary by Gender | 4 | 44.64 | <.0001 | 27.66 |
| Overall Effect of Race | 8 | 162.82 | <.0001 | 100.94 |
There is overwhelming evidence that the prevalence of obesity varies by race \((p \approx 0, s=101)\). There is very strong evidence that the race differences in the log odds of obesity vary by gender \((p \approx 0, s=28)\).
Recommended Practice
LIBNAME NHANES "../../../BMI552_Datasets/NHANES_May2023/";
*LIBNAME NHANES "~/my_shared_file_links/u48718377/NHANES";
OPTIONS FMTSEARCH=(NHANES.formats_NHANES work) NODATE NONUMBER;
TITLE1; TITLE2;
ODS NOPROCTITLE;
ODS GRAPHICS / LOESSMAXOBS=20000;
DATA Ron;
SET NHANES.NHANES (RENAME=(Race1=Race));
RUN;
ODS SELECT NONE;
PROC LOGISTIC DATA=Ron;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
STORE Initial;
RUN;
ODS SELECT ALL;
For model checking, we would recommend comparisons with a model with one additional df for age and the three models with exactly one of the two-way interaction using BIC.
TITLE "Additive";
PROC LOGISTIC DATA=Ron;
ODS SELECT ResponseProfile FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
RUN;
TITLE;
TITLE "Interaction - Gender*Race";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*Race / LINK=LOGIT;
RUN;
TITLE;
TITLE "Interaction - Gender*Age";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*AgeCS / LINK=LOGIT;
RUN;
TITLE;
TITLE "Interaction - Race*Age";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Race*AgeCS / LINK=LOGIT;
RUN;
TITLE;
TITLE "Age +1df";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
RUN;
TITLE;RUN;
TITLE;
| Response Profile | ||
|---|---|---|
|
Ordered Value |
Obese |
Total Frequency |
| 1 | No | 4643 |
| 2 | Yes | 2592 |
Probability modeled is Obese='Yes'.
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9292.114 |
| SC | 9449.253 | 9360.981 |
| -2 Log L | 9440.367 | 9272.114 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9255.557 |
| SC | 9449.253 | 9351.971 |
| -2 Log L | 9440.367 | 9227.557 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9291.520 |
| SC | 9449.253 | 9387.934 |
| -2 Log L | 9440.367 | 9263.520 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9291.016 |
| SC | 9449.253 | 9470.070 |
| -2 Log L | 9440.367 | 9239.016 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9292.541 |
| SC | 9449.253 | 9368.294 |
| -2 Log L | 9440.367 | 9270.541 |
DATA Calculator;
BIC_Initial = 9272.114 + LOG(2592)*10;
BIC_GxR = 9227.557 + LOG(2592)*14;
Delta_BIC = BIC_GxR - BIC_Initial;
KEEP BIC_Initial BIC_GxR Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
DATA Calculator;
BIC_Initial = 9272.114 + LOG(2592)*10;
BIC_GxA = 9263.520 + LOG(2592)*14;
Delta_BIC = BIC_GxA - BIC_Initial;
KEEP BIC_Initial BIC_GxA Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
DATA Calculator;
BIC_Initial = 9272.114 + LOG(2592)*10;
BIC_RxA = 9239.016 + LOG(2592)*26;
Delta_BIC = BIC_RxA - BIC_Initial;
KEEP BIC_Initial BIC_RxA Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
DATA Calculator;
BIC_Initial = 9272.114 + LOG(2592)*10;
BIC_Age5df = 9270.541 + LOG(2592)*11;
Delta_BIC = BIC_Age5df - BIC_Initial;
KEEP BIC_Initial BIC_Age5df Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
| BIC_Initial | BIC_GxR | Delta_BIC |
|---|---|---|
| 9350.72 | 9337.60 | -13.1163 |
| BIC_Initial | BIC_GxA | Delta_BIC |
|---|---|---|
| 9350.72 | 9373.56 | 22.8467 |
| BIC_Initial | BIC_RxA | Delta_BIC |
|---|---|---|
| 9350.72 | 9443.38 | 92.6650 |
| BIC_Initial | BIC_Age5df | Delta_BIC |
|---|---|---|
| 9350.72 | 9357.00 | 6.28719 |
Based on BIC, the model with a two-way interaction between gender and race is decisively better than the initial model. So, we restart the model building process starting with that model.
ODS SELECT NONE;
PROC LOGISTIC DATA=Ron;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*Race / LINK=LOGIT;
STORE Revised;
RUN;
ODS SELECT ALL;
For model checking, we would recommend comparisons with a model with one additional df for age and the two models with one additional two-way interaction using BIC.
TITLE "Revised";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*Race / LINK=LOGIT;
RUN;
TITLE;
TITLE "Additional Interaction - Gender*Age";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*Race Gender*AgeCS / LINK=LOGIT;
RUN;
TITLE;
TITLE "Additional Interaction - Race*Age";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*Race Race*AgeCS / LINK=LOGIT;
RUN;
TITLE;
TITLE "Age +1df";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));
MODEL Obese(EVENT="Yes") = Gender Race Gender*Race AgeCS / LINK=LOGIT;
RUN;
TITLE;RUN;
TITLE;
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9255.557 |
| SC | 9449.253 | 9351.971 |
| -2 Log L | 9440.367 | 9227.557 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9253.042 |
| SC | 9449.253 | 9377.002 |
| -2 Log L | 9440.367 | 9217.042 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9255.203 |
| SC | 9449.253 | 9461.803 |
| -2 Log L | 9440.367 | 9195.203 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9255.904 |
| SC | 9449.253 | 9359.204 |
| -2 Log L | 9440.367 | 9225.904 |
DATA Calculator;
BIC_Revised = 9227.555 + LOG(2592)*14;
BIC_GxA = 9217.042 + LOG(2592)*18;
Delta_BIC = BIC_GxA - BIC_Revised;
KEEP BIC_Revised BIC_GxA Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
DATA Calculator;
BIC_Revised = 9227.555 + LOG(2592)*14;
BIC_RxA = 9195.203 + LOG(2592)*30;
Delta_BIC = BIC_RxA - BIC_Revised;
KEEP BIC_Revised BIC_RxA Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
DATA Calculator;
BIC_Revised = 9227.555 + LOG(2592)*14;
BIC_Age5df = 9225.904 + LOG(2592)*15;
Delta_BIC = BIC_Age5df - BIC_Revised;
KEEP BIC_Revised BIC_Age5df Delta_BIC;
RUN;
TITLE "BIC Calculator";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
| BIC_Revised | BIC_GxA | Delta_BIC |
|---|---|---|
| 9337.60 | 9358.53 | 20.9277 |
| BIC_Revised | BIC_RxA | Delta_BIC |
|---|---|---|
| 9337.60 | 9431.01 | 93.4110 |
| BIC_Revised | BIC_Age5df | Delta_BIC |
|---|---|---|
| 9337.60 | 9343.81 | 6.20919 |
Based on BIC, there are no concerns about the adequacy of the revised model.
If (partial) likelihood ratio tests required, you should fit the relevant models to get the log-likelihoods and calculate the test statistics (and \(p\)-values) by hand.
TITLE "Revised";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS Gender*Race / LINK=LOGIT;
RUN;
TITLE;
TITLE "Additive";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender Race AgeCS / LINK=LOGIT;
RUN;
TITLE;
TITLE "Race Dropped";
PROC LOGISTIC DATA=Ron;
ODS SELECT FitStatistics;
CLASS Gender Race / ORDER=INTERNAL;
EFFECT AgeCS=SPLINE(Age /
NATURALCUBIC BASIS=TPF(NOINT) KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
MODEL Obese(EVENT="Yes") = Gender AgeCS / LINK=LOGIT;
RUN;
TITLE;
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9255.557 |
| SC | 9449.253 | 9351.971 |
| -2 Log L | 9440.367 | 9227.557 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9292.114 |
| SC | 9449.253 | 9360.981 |
| -2 Log L | 9440.367 | 9272.114 |
| Model Fit Statistics | ||
|---|---|---|
| Criterion | Intercept Only | Intercept and Covariates |
| AIC | 9442.367 | 9402.207 |
| SC | 9449.253 | 9443.527 |
| -2 Log L | 9440.367 | 9390.207 |
DATA Calculator;
Minus2LL0 = 9272.114;
Minus2LL1 = 9227.555;
df0 = 10;
df1 = 14;
Chisq = Minus2LL0 - Minus2LL1;
df = df1 - df0;
ProbChiSq = SDF("CHISQURE",Chisq,df);
sValue = -LOG(ProbChiSq)/LOG(2);
KEEP Chisq df ProbChiSq sValue;
RUN;
TITLE "Gender*Race Interaction";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
DATA Calculator;
Minus2LL0 = 9390.207;
Minus2LL1 = 9227.555;
df0 = 6;
df1 = 14;
Chisq = Minus2LL0 - Minus2LL1;
df = df1 - df0;
ProbChiSq = SDF("CHISQURE",Chisq,df);
sValue = -LOG(ProbChiSq)/LOG(2);
KEEP Chisq df ProbChiSq sValue;
RUN;
TITLE "Race Overall";
PROC PRINT DATA=Calculator NOOBS;
RUN;
TITLE;
| Chisq | df | ProbChiSq | sValue |
|---|---|---|---|
| 44.559 | 4 | 4.9103E-9 | 27.6015 |
| Chisq | df | ProbChiSq | sValue |
|---|---|---|---|
| 162.652 | 8 | 4.4588E-31 | 100.823 |
The likelihood ratio test statistic for the interaction between gender and race is \(44.56 = 9272.114-9227.555\) on 4 df, the \(p\)-value is \(4.9 \times {10}^{-9} \approx 0\), and the \(s\)-value is \(27.6\). The likelihood ratio test statistic for any effects of race is \(162.65 = 9390.207-9227.555\) on 8 df, the \(p\)-value is \(2.5 \times {10}^{-31} \approx 0\), and the \(s\)-value is \(100.8\).
ODS EXCLUDE ALL;
PROC PLM RESTORE=Revised;
ODS OUTPUT Estimates=E;
ESTIMATE "Age 30 v 50" AgeCS [1, 30] [-1, 50] / CL EXP ALPHA=0.03125;
ESTIMATE "Age 40 v 50" AgeCS [1, 40] [-1, 50] / CL EXP ALPHA=0.03125;
ESTIMATE "Age 60 v 50" AgeCS [1, 60] [-1, 50] / CL EXP ALPHA=0.03125;
ESTIMATE "Age 70 v 50" AgeCS [1, 70] [-1, 50] / CL EXP ALPHA=0.03125;
ESTIMATE "Age 80 v 50" AgeCS [1, 80] [-1, 50] / CL EXP ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=E NOOBS LABEL;
VAR Label LowerExp ExpEstimate UpperExp;
LABEL Label='00'x
LowerExp="Lower CL"
ExpEstimate="Estimate"
UpperExp="Upper CL";
TITLE "Odds Ratios (Main Effect of Age)";
RUN;
TITLE;
| Lower CL | Estimate | Upper CL | |
|---|---|---|---|
| Age 30 v 50 | 0.8365 | 0.9943 | 1.1820 |
| Age 40 v 50 | 0.9484 | 1.0523 | 1.1675 |
| Age 60 v 50 | 1.0565 | 1.2120 | 1.3902 |
| Age 70 v 50 | 0.9217 | 1.0752 | 1.2542 |
| Age 80 v 50 | 0.6263 | 0.7838 | 0.9809 |